In [1]:
import pandas as pd
from IPython.display import HTML
import base64, io, IPython
from PIL import Image as PILImage
from IPython.display import Image
from IPython import display

Name:¶

TJ Sipin

Contributors:¶

Preeti Kulkarni and Gian Taparan

Mini project 2: primary productivity in coastal waters¶

In this project you're again given a dataset and some questions. The data for this project come from the EPA's National Aquatic Resource Surveys, and in particular the National Coastal Condition Assessment (NCCA); broadly, you'll do an exploratory analysis of primary productivity in coastal waters.

By way of background, chlorophyll A is often used as a proxy for primary productivity in marine ecosystems; primary producers are important because they are at the base of the food web. Nitrogen and phosphorus are key nutrients that stimulate primary production.

In the data folder you'll find water chemistry data, site information, and metadata files. It might be helpful to keep the metadata files open when tidying up the data for analysis. It might also be helpful to keep in mind that these datasets contain a considerable amount of information, not all of which is relevant to answering the questions of interest. Notice that the questions pertain somewhat narrowly to just a few variables. It's recommended that you determine which variables might be useful and drop the rest.

As in the first mini project, there are accurate answers to each question that are mutually consistent with the data, but there aren't uniquely correct answers. You will likely notice that you have even more latitude in this project than in the first, as the questions are slightly broader. Since we've been emphasizing visual and exploratory techniques in class, you are encouraged (but not required) to support your answers with graphics.

The broader goal of these mini projects is to cultivate your problem-solving ability in an unstructured setting. Your work will be evaluated based on the following:

  • choice of method(s) used to answer questions;
  • clarity of presentation;
  • code style and documentation.

Please write up your results separately from your codes; codes should be included at the end of the notebook.


Part 1: dataset¶

Merge the site information with the chemistry data and tidy it up. Determine which columns to keep based on what you use in answering the questions in part 2; then, print the first few rows here (but do not include your codes used in tidying the data) and write a brief description (1-2 paragraphs) of the dataset conveying what you take to be the key attributes. Direct your description to a reader unfamiliar with the data; ensure that in your data preview the columns are named intelligibly.

Suggestion: export your cleaned data as a separate .csv file and read that directly in below, as in: pd.read_csv('YOUR DATA FILE').head().

In [2]:
# show a few rows of clean data

pd.read_csv('data').head()
Out[2]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Province Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 Californian Province 0.000 3.34 0.014 0.028 0.014 0.40750 0.061254
1 60 CA 7/1/2010 San Diego Bay West 3.5 32.71424 -117.23527 Californian Province 0.010 2.45 0.020 0.026 0.010 0.23000 0.037379
2 61 CA 7/1/2010 Mission Bay West 2.2 32.78372 -117.22132 Californian Province 0.000 3.82 0.009 0.030 0.009 0.33625 0.048100
3 62 CA 7/1/2010 San Diego Bay West 9.5 32.72245 -117.20443 Californian Province 0.000 6.13 0.010 0.028 0.010 0.23875 0.044251
4 63 NC 6/9/2010 White Oak River Southeast 1.0 34.75098 -77.12117 Carolinian Province 0.002 9.79 0.030 0.043 0.028 0.63250 0.090636

The dataset above contains amounts of various nutrients (like ammonia, nitrogen, and phosphate) and levels of productivity (via chlorophyll A levels) in several bodies of water across the US during the summer months of 2010. Each observation includes date collected as well as longitude/latidude.

Part 2: exploratory analysis¶

Answer each question below and provide a visualization supporting your answer. A description and interpretation of the visualization should be offered.

Comment: you can either designate your plots in the codes section with clear names and reference them in your answers; or you can export your plots as image files and display them in markdown cells.

What is the apparent relationship between nutrient availability and productivity?¶

Comment: it's fine to examine each nutrient -- nitrogen and phosphorus -- separately, but do consider whether they might be related to each other.

Total nitrogen and total phosphorous are moderately correlated with each other. Chlorophyll A has moderate-high correlation with nutrient availability.

Are there any notable differences in available nutrients among U.S. coastal regions?¶

Generally, the west has slightly higher phosphorus and ammonia levels, while the east has higher nitrogen levels. Upon further investigation, the west may have higher phosphorus levels due to agricultural malpractice, such as high runoff rates.

The west also has lower variability in ammonia levels, whereas the east has many outliers, specifically the northeast.

One can see that for nitrogen and phosphorus, but nitrogen especially, the Great Lakes tend to have many more outliers than the other regions. This may be due to a lower flow of fresh water to carry nutrients compared to the currents of the oceans, which allows higher opportunity for buildup of nutrients.

For the Gulf, the distribution is incredibly gaussian, but there is one outlier that stands apart from the rest. This would be interesting to investigate.

It must be noted that all but one outlier are in the upper-tail.

Based on the 2010 data, does productivity seem to vary geographically in some way?¶

If so, explain how; If not, explain what options you considered and ruled out.

Just like for the nutrients, the Great Lakes region features the most outliers with the lowest median chlorophyll A levels. However, each region features many outliers. The west does not seem to have high chlorophyll concentrations compared to the east and the gulf.

This is most likely due to the corresponding nutrient levels.

How does primary productivity in California coastal waters change seasonally in 2010, if at all?¶

Does your result make intuitive sense?

There is a slight upward trend, though the variance is high. This might make sense as harvesting in California is typically done in the autumn, so there is a higher risk for fertilizer runoff as mentioned before. This runoff of fertilizer, which is high in phosphorus, may be responsible for the high productivity levels.

Pose and answer one additional question: What's the relationship between water depth and productivity?¶

There seems to be a generally negative relationship in chlorophyll a levels and water depth, with intensely high levels in shallow watters (< 20 meters).


Codes¶

In [3]:
import pandas as pd
import numpy as np
import altair as alt

ncca_raw = pd.read_csv('assessed_ncca2010_waterchem.csv')
ncca_sites = pd.read_csv('assessed_ncca2010_siteinfo.csv')
In [4]:
# see how many different units of measurements for ammonia
ncca_raw[ncca_raw.PARAMETER_NAME == 'Ammonia'].UNITS.describe()
Out[4]:
count       1091
unique         1
top       mg N/L
freq        1091
Name: UNITS, dtype: object

1¶

In [5]:
ncca_raw.head()
Out[5]:
UID SITE_ID STATE DATE_COL BATCH_ID PARAMETER PARAMETER_NAME RESULT UNITS MDL MRL PQL DATE_ANALYZED HOLDING_TIME QACODE LAB_SAMPLE_ID SAMPLE_ID METHOD
0 59 NCCA10-1111 CA 7/1/2010 100714.1 NTL Total Nitrogen 0.407500 mg N/L 0.0150 0.0300 NaN 7/14/2010 13.0 NaN 1010242 568671.0 NaN
1 59 NCCA10-1111 CA 7/1/2010 100708.1 NO3NO2 Nitrate/Nitrite 0.014000 mg N/L 0.0020 0.0040 NaN 7/8/2010 7.0 NaN 1010242 568673.0 NaN
2 59 NCCA10-1111 CA 7/1/2010 100708.1 SRP Dissolved Inorganic Phosphate 0.028000 mg P/L 0.0027 0.0054 NaN 7/8/2010 7.0 NaN 1010242 568673.0 NaN
3 59 NCCA10-1111 CA 7/1/2010 IM_CALCULATED DIN Dissolved Inorganic Nitrogen 0.014000 mg N/L NaN NaN NaN NaN NaN Q23 NaN NaN NaN
4 59 NCCA10-1111 CA 7/1/2010 100714.1 PTL Total Phosphorus 0.061254 mg P/L 0.0012 0.0024 NaN 7/14/2010 13.0 NaN 1010242 568671.0 NaN
In [6]:
# variables to keep in merged data frame
raw_vars = ['UID', 'STATE', 'DATE_COL', 
            'PARAMETER_NAME', 'RESULT']
sites_vars = ['WTBDY_NM', 'NCCR_REG',
              'STATION_DEPTH', 'ALAT_DD', 
              'ALON_DD']
vars_to_keep = raw_vars + sites_vars
In [7]:
# merge ncca_raw and ncca_sites
data_mod1 = pd.merge(ncca_raw, ncca_sites,
                       how='right',
                       on = 'UID'
                     )
data_mod1.head()
Out[7]:
UID SITE_ID_x STATE_x DATE_COL_x BATCH_ID PARAMETER PARAMETER_NAME RESULT UNITS MDL ... NPSPARK PANEL STATUS10 STRATUM TNT WGT_CAT WGT_NCCA10 RSRC_CLASS QA_CODES COMMENT
0 59 NCCA10-1111 CA 7/1/2010 100714.1 NTL Total Nitrogen 0.407500 mg N/L 0.0150 ... NaN Base Target_Sampled CalP_Other Target NCA_CA_CalP_Other 2.503632 NCA_Estuarine_Coastal NaN NaN
1 59 NCCA10-1111 CA 7/1/2010 100708.1 NO3NO2 Nitrate/Nitrite 0.014000 mg N/L 0.0020 ... NaN Base Target_Sampled CalP_Other Target NCA_CA_CalP_Other 2.503632 NCA_Estuarine_Coastal NaN NaN
2 59 NCCA10-1111 CA 7/1/2010 100708.1 SRP Dissolved Inorganic Phosphate 0.028000 mg P/L 0.0027 ... NaN Base Target_Sampled CalP_Other Target NCA_CA_CalP_Other 2.503632 NCA_Estuarine_Coastal NaN NaN
3 59 NCCA10-1111 CA 7/1/2010 IM_CALCULATED DIN Dissolved Inorganic Nitrogen 0.014000 mg N/L NaN ... NaN Base Target_Sampled CalP_Other Target NCA_CA_CalP_Other 2.503632 NCA_Estuarine_Coastal NaN NaN
4 59 NCCA10-1111 CA 7/1/2010 100714.1 PTL Total Phosphorus 0.061254 mg P/L 0.0012 ... NaN Base Target_Sampled CalP_Other Target NCA_CA_CalP_Other 2.503632 NCA_Estuarine_Coastal NaN NaN

5 rows × 48 columns

In [8]:
vars_to_keep_1 = ['UID', 'STATE_x', 'DATE_COL_x', 
                   'PARAMETER_NAME', 'RESULT','WTBDY_NM',
                   'NCCR_REG', 'STATION_DEPTH', 'ALAT_DD', 
                   'ALON_DD', 'PROVINCE']
In [9]:
data_mod2 = data_mod1.loc[:,vars_to_keep_1]
In [10]:
data_mod2.head()
Out[10]:
UID STATE_x DATE_COL_x PARAMETER_NAME RESULT WTBDY_NM NCCR_REG STATION_DEPTH ALAT_DD ALON_DD PROVINCE
0 59 CA 7/1/2010 Total Nitrogen 0.407500 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
1 59 CA 7/1/2010 Nitrate/Nitrite 0.014000 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
2 59 CA 7/1/2010 Dissolved Inorganic Phosphate 0.028000 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
3 59 CA 7/1/2010 Dissolved Inorganic Nitrogen 0.014000 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
4 59 CA 7/1/2010 Total Phosphorus 0.061254 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
In [11]:
# keep rows with states
data_mod3 = data_mod2[data_mod2.STATE_x.notna()]
data_mod3
Out[11]:
UID STATE_x DATE_COL_x PARAMETER_NAME RESULT WTBDY_NM NCCR_REG STATION_DEPTH ALAT_DD ALON_DD PROVINCE
0 59 CA 7/1/2010 Total Nitrogen 0.407500 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
1 59 CA 7/1/2010 Nitrate/Nitrite 0.014000 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
2 59 CA 7/1/2010 Dissolved Inorganic Phosphate 0.028000 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
3 59 CA 7/1/2010 Dissolved Inorganic Nitrogen 0.014000 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
4 59 CA 7/1/2010 Total Phosphorus 0.061254 Mission Bay West 2.5 32.77361 -117.21471 Californian Province
... ... ... ... ... ... ... ... ... ... ... ...
7873 16731 CA 6/29/2010 Total Nitrogen 0.228750 San Diego Bay West 4.8 32.66243 -117.12712 Californian Province
7874 16731 CA 6/29/2010 Total Phosphorus 0.041821 San Diego Bay West 4.8 32.66243 -117.12712 Californian Province
7875 16731 CA 6/29/2010 Dissolved Inorganic Phosphate 0.033000 San Diego Bay West 4.8 32.66243 -117.12712 Californian Province
7876 16731 CA 6/29/2010 Ammonia 0.016000 San Diego Bay West 4.8 32.66243 -117.12712 Californian Province
7877 16731 CA 6/29/2010 Nitrate/Nitrite 0.012000 San Diego Bay West 4.8 32.66243 -117.12712 Californian Province

7876 rows × 11 columns

In [12]:
# pivot data frame so each parameter name is its own column
data_mod4 = data_mod3.pivot(
    index = data_mod3.drop(['PARAMETER_NAME', 'RESULT'], axis = 1).columns,
    columns = 'PARAMETER_NAME',
    values = 'RESULT'
).reset_index(
).rename_axis(
    columns = {'PARAMETER_NAME':''}
)
data_mod4
Out[12]:
UID STATE_x DATE_COL_x WTBDY_NM NCCR_REG STATION_DEPTH ALAT_DD ALON_DD PROVINCE Ammonia ... Nitrate Nitrate/Nitrite Nitrite Nitrogen Particulate Phosphorus Particulate Total Dissolved Nitrogen Total Dissolved Phosphorus Total Kjeldahl Nitrogen Total Nitrogen Total Phosphorus
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 Californian Province 0.000 ... NaN 0.014 NaN NaN NaN NaN NaN NaN 0.407500 0.061254
1 60 CA 7/1/2010 San Diego Bay West 3.5 32.71424 -117.23527 Californian Province 0.010 ... NaN 0.010 NaN NaN NaN NaN NaN NaN 0.230000 0.037379
2 61 CA 7/1/2010 Mission Bay West 2.2 32.78372 -117.22132 Californian Province 0.000 ... NaN 0.009 NaN NaN NaN NaN NaN NaN 0.336250 0.048100
3 62 CA 7/1/2010 San Diego Bay West 9.5 32.72245 -117.20443 Californian Province 0.000 ... NaN 0.010 NaN NaN NaN NaN NaN NaN 0.238750 0.044251
4 63 NC 6/9/2010 White Oak River Southeast 1.0 34.75098 -77.12117 Carolinian Province 0.002 ... NaN 0.028 NaN NaN NaN NaN NaN NaN 0.632500 0.090636
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1087 16727 MI 6/18/2010 Lake Michigan Great Lakes 0.6 44.98607 -85.64046 Great Lakes Province 0.003 ... NaN 0.257 NaN NaN NaN NaN NaN NaN 0.380000 0.000000
1088 16728 MI 6/25/2010 Lake Michigan Great Lakes 2.3 44.94789 -85.94790 Great Lakes Province 0.005 ... NaN 0.230 NaN NaN NaN NaN NaN NaN 0.437625 0.006249
1089 16729 MI 6/16/2010 Lake Michigan Great Lakes 31.2 44.83721 -85.52862 Great Lakes Province 0.010 ... NaN 0.240 NaN NaN NaN NaN NaN NaN 0.361250 0.000000
1090 16730 CA 6/29/2010 San Diego Bay West 4.1 32.66443 -117.13879 Californian Province 0.017 ... NaN 0.011 NaN NaN NaN NaN NaN NaN 0.213000 0.044127
1091 16731 CA 6/29/2010 San Diego Bay West 4.8 32.66243 -117.12712 Californian Province 0.016 ... NaN 0.012 NaN NaN NaN NaN NaN NaN 0.228750 0.041821

1092 rows × 24 columns

In [13]:
# see how many rows display total dissolved nitrogen
data_mod4[data_mod4['Total Dissolved Nitrogen'].notna()]
Out[13]:
UID STATE_x DATE_COL_x WTBDY_NM NCCR_REG STATION_DEPTH ALAT_DD ALON_DD PROVINCE Ammonia ... Nitrate Nitrate/Nitrite Nitrite Nitrogen Particulate Phosphorus Particulate Total Dissolved Nitrogen Total Dissolved Phosphorus Total Kjeldahl Nitrogen Total Nitrogen Total Phosphorus
221 587 VA 7/13/2010 Warwick River Northeast 3.0 36.899760 -76.458730 Virginian Province 0.004 ... 0.0010 0.001 0.000 0.284 0.0493 0.216 0.033 NaN 0.500 0.0823
222 588 VA 7/13/2010 Lower James River Northeast 1.5 36.954960 -76.273370 Virginian Province 0.025 ... 0.0050 0.005 0.000 0.272 0.0353 0.194 0.045 NaN 0.466 0.0803
258 639 VA 7/8/2010 Back Bay Northeast 1.5 36.610230 -75.981980 Carolinian Province 0.009 ... 0.0020 0.002 0.001 0.877 0.0103 0.582 0.008 NaN 1.459 0.0183
397 819 VA 8/5/2010 Broad/Linkhorn Bay Northeast 0.9 36.890760 -76.070580 Virginian Province 0.010 ... 0.0080 0.008 0.000 0.619 0.0336 0.236 0.022 NaN 0.855 0.0556
398 820 VA 7/27/2010 Elizabeth River Northeast 10.4 36.769190 -76.296590 Virginian Province 0.008 ... 0.0240 0.146 0.122 0.277 0.0364 0.360 0.070 NaN 0.637 0.1064
399 822 VA 7/27/2010 Lower James River Northeast 15.6 36.880450 -76.335060 Virginian Province 0.008 ... 0.0070 0.013 0.006 0.240 0.0649 0.251 0.047 NaN 0.491 0.1119
659 1235 VA 7/21/2010 Pocomoke Sound Northeast 5.2 37.381930 -76.010570 Virginian Province 0.012 ... 0.0020 0.002 0.000 0.206 0.0313 0.180 0.010 NaN 0.386 0.0413
660 1236 VA 7/22/2010 Milford Haven Northeast 10.2 37.625233 -76.206816 Virginian Province 0.014 ... 0.0020 0.002 0.000 0.321 0.0317 0.212 0.012 NaN 0.533 0.0437
661 1237 VA 7/22/2010 Potomac River Northeast 10.0 37.678733 -76.262283 Virginian Province 0.011 ... 0.0020 0.002 0.000 0.270 0.0324 0.205 0.008 NaN 0.475 0.0404
662 1238 VA 7/22/2010 Pocomoke Sound Northeast 11.5 37.620250 -76.079017 Virginian Province 0.030 ... 0.0060 0.006 0.000 0.152 0.0362 0.218 0.010 NaN 0.370 0.0462
663 1239 VA 7/20/2010 Pocomoke Sound Northeast 5.8 37.115433 -76.019100 Virginian Province 0.022 ... 0.0030 0.003 0.000 0.120 0.0435 0.175 0.013 NaN 0.295 0.0565
664 1240 VA 7/21/2010 Milford Haven Northeast 2.7 37.443450 -76.239730 Virginian Province 0.012 ... 0.0040 0.004 0.000 0.211 0.0446 0.207 0.008 NaN 0.418 0.0526
665 1241 VA 7/20/2010 Pocomoke Sound Northeast 3.4 37.238600 -76.043500 Virginian Province 0.012 ... 0.0030 0.003 0.000 0.179 0.0540 0.185 0.009 NaN 0.364 0.0630
671 1251 VA 9/16/2010 Hog Island Bay Northeast 2.5 37.365020 -75.724020 Virginian Province 0.008 ... 0.0080 0.009 0.001 0.161 0.0900 0.218 0.035 NaN 0.379 0.1250
923 1794 VA 6/30/2010 Chickahominy River Northeast 3.5 37.293100 -76.893183 Virginian Province 0.007 ... 0.0000 0.000 0.000 0.184 0.0133 0.301 0.016 NaN 0.485 0.0293
924 1796 VA 8/25/2010 Pocomoke River Northeast 1.3 37.942600 -75.642600 Virginian Province 0.012 ... 0.0030 0.005 0.002 0.378 0.0251 0.565 0.069 NaN 0.943 0.0941
925 1797 VA 7/21/2010 Milford Haven Northeast 9.9 37.397550 -76.165530 Virginian Province 0.017 ... 0.0030 0.003 0.000 0.236 0.0351 0.188 0.010 NaN 0.424 0.0451
926 1798 VA 9/9/2010 Rappahannock River Northeast 3.7 37.579900 -76.385900 Virginian Province 0.026 ... 0.0051 0.067 0.062 0.184 0.0372 0.454 0.022 NaN 0.638 0.0592
927 1799 VA 7/22/2010 Pocomoke Sound Northeast 3.1 37.595380 -75.937333 Virginian Province 0.013 ... 0.0030 0.003 0.000 0.234 0.0417 0.208 0.008 NaN 0.442 0.0497
928 1800 VA 7/21/2010 York River Northeast 7.2 37.294017 -76.327500 Virginian Province 0.068 ... 0.0050 0.005 0.000 0.201 0.0715 0.260 0.012 NaN 0.461 0.0835
929 1801 VA 8/25/2010 Upper James River Northeast 6.3 37.262570 -76.981650 Virginian Province 0.025 ... 0.1450 0.229 0.084 0.123 0.0869 0.532 0.040 NaN 0.655 0.1269
942 1836 VA 8/5/2010 Upper James River Northeast 1.0 37.363930 -77.268250 Virginian Province 0.085 ... 0.2370 0.261 0.024 0.614 0.0944 0.697 0.021 NaN 1.311 0.1154
946 1856 VA 7/15/2010 York River Northeast 2.2 37.160067 -76.302633 Virginian Province 0.006 ... 0.0030 0.003 0.000 0.239 0.0514 0.228 0.015 NaN 0.467 0.0664

23 rows × 24 columns

In [14]:
# see which columns have mostly non NA entries
(data_mod4.notna().sum()/len(data_mod4)) > 0.9
Out[14]:
UID                               True
STATE_x                           True
DATE_COL_x                        True
WTBDY_NM                          True
NCCR_REG                          True
STATION_DEPTH                     True
ALAT_DD                           True
ALON_DD                           True
PROVINCE                          True
Ammonia                           True
Chlorophyll A                     True
Dissolved Inorganic Nitrogen      True
Dissolved Inorganic Phosphate     True
Dissolved Silica                 False
Nitrate                          False
Nitrate/Nitrite                   True
Nitrite                          False
Nitrogen Particulate             False
Phosphorus Particulate           False
Total Dissolved Nitrogen         False
Total Dissolved Phosphorus       False
Total Kjeldahl Nitrogen          False
Total Nitrogen                    True
Total Phosphorus                  True
dtype: bool
In [15]:
# keep only columns which have > 90% non NA entries
# cleaned data
data_mod5 = data_mod4[data_mod4.columns[(data_mod4.notna().sum()/len(data_mod4)) > 0.9]]
data = data_mod5.rename(
    columns = {
        'STATE_x':'State',
        'DATE_COL_x':'Date collected',
        'WTBDY_NM':'Waterbody name',
        'NCCR_REG':'Region',
        'STATION_DEPTH':'Water depth (in meters)',
        'ALAT_DD':'Latitude',
        'ALON_DD':'Longitude',
        'PROVINCE':'Province'
    }
)
data.head()
Out[15]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Province Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 Californian Province 0.000 3.34 0.014 0.028 0.014 0.40750 0.061254
1 60 CA 7/1/2010 San Diego Bay West 3.5 32.71424 -117.23527 Californian Province 0.010 2.45 0.020 0.026 0.010 0.23000 0.037379
2 61 CA 7/1/2010 Mission Bay West 2.2 32.78372 -117.22132 Californian Province 0.000 3.82 0.009 0.030 0.009 0.33625 0.048100
3 62 CA 7/1/2010 San Diego Bay West 9.5 32.72245 -117.20443 Californian Province 0.000 6.13 0.010 0.028 0.010 0.23875 0.044251
4 63 NC 6/9/2010 White Oak River Southeast 1.0 34.75098 -77.12117 Carolinian Province 0.002 9.79 0.030 0.043 0.028 0.63250 0.090636
In [16]:
# exported data as a separate .csv file to read 
data_csv = data.to_csv('data', index=False)

2¶

What is the apparent relationship between nutrient availability and productivity?¶

In [17]:
alt.data_transformers.disable_max_rows()
Out[17]:
DataTransformerRegistry.enable('default')
In [18]:
data.head(1)
Out[18]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Province Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 Californian Province 0.0 3.34 0.014 0.028 0.014 0.4075 0.061254
In [19]:
# df for scatter

scatter_df = data.melt(
    id_vars = ['UID', 'State', 'Date collected',
               'Waterbody name', 'Region', 'Water depth (in meters)',
               'Latitude', 'Longitude', 'Chlorophyll A', 'Province'],
    var_name = 'Nutrient',
    value_name = 'Level'
)
In [20]:
scatter_df.head()
Out[20]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Chlorophyll A Province Nutrient Level
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 3.34 Californian Province Ammonia 0.000
1 60 CA 7/1/2010 San Diego Bay West 3.5 32.71424 -117.23527 2.45 Californian Province Ammonia 0.010
2 61 CA 7/1/2010 Mission Bay West 2.2 32.78372 -117.22132 3.82 Californian Province Ammonia 0.000
3 62 CA 7/1/2010 San Diego Bay West 9.5 32.72245 -117.20443 6.13 Californian Province Ammonia 0.000
4 63 NC 6/9/2010 White Oak River Southeast 1.0 34.75098 -77.12117 9.79 Carolinian Province Ammonia 0.002
In [21]:
# panel for ammonia
scatter_panel_ammonia = alt.Chart(scatter_df).mark_circle(opacity = 0.2).encode(
    x = alt.X('Ammonia:Q', scale = alt.Scale(zero = True), title = ''),
    y = alt.Y('Level', scale = alt.Scale(zero = True), title = '')
).properties(
    width = 150, 
    height = 150
).facet(
    column = alt.Column('Nutrient', title = 'Ammonia Levels mg N/L')
).resolve_scale(x = 'independent', y = 'independent')

# panel for chlorphyll a
scatter_panel_Chl = alt.Chart(scatter_df).mark_circle(opacity = 0.2).encode(
    x = alt.X('Chlorophyll A', scale = alt.Scale(zero = True), title = ''),
    y = alt.Y('Level', scale = alt.Scale(zero = True), title = '')
).properties(
    width = 150, 
    height = 150
).facet(
    column = alt.Column('Nutrient', title = 'Chlorophyll A Levels ug/L')
).resolve_scale(x = 'independent', y = 'independent')
In [22]:
scatter_panel_Chl
Out[22]:
In [23]:
# scatter for phosphorus against nitrogen
scatter_panel_P_N = alt.Chart(data).mark_circle(opacity = 0.2).encode(
    x = alt.X('Total Phosphorus', scale = alt.Scale(zero = True)),
    y = alt.Y('Total Nitrogen', scale = alt.Scale(zero = True))
).properties(
    width = 150, 
    height = 150
)
In [24]:
scatter_panel_P_N
Out[24]:
In [29]:
# data frame with only quantitative variables
x_mx = data.iloc[:, 8:16].drop(columns = 'Province')

# long form dataframe for plotting panel
scatter_df_long = x_mx.melt(
    var_name = 'row',
    value_name = 'row_index'
).join(
    pd.concat([x_mx, x_mx, x_mx, x_mx, x_mx, 
               x_mx, x_mx], axis = 0).reset_index(),
).drop(
    columns = 'index'
).melt(
    id_vars = ['row', 'row_index'],
    var_name = 'col',
    value_name = 'col_index'
)

# panel
scatter_panel = alt.Chart(scatter_df_long).mark_point(opacity = 0.4).encode(
    x = alt.X('row_index', scale = alt.Scale(zero = False), title = ''),
    y = alt.Y('col_index', scale = alt.Scale(zero = False), title = '')
).properties(
    width = 150, 
    height = 75
).facet(
    column = alt.Column('col', title = ''),
    row = alt.Row('row', title = '')
).resolve_scale(x = 'independent', y = 'independent')
In [30]:
# Pairwise relationship for plotting panel
scatter_panel
Out[30]:
In [31]:
# save above figure as html to export in final report
scatter_panel.save('variance_scatter.html')
In [32]:
# Correlation matrix for just nutrients and productivity
x_mx.corr()
Out[32]:
Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
Ammonia 1.000000 0.076214 0.223906 0.373070 0.128686 0.288228 0.321642
Chlorophyll A 0.076214 1.000000 0.188035 0.196624 0.185112 0.641165 0.512931
Dissolved Inorganic Nitrogen 0.223906 0.188035 1.000000 0.258240 0.995142 0.716507 0.234987
Dissolved Inorganic Phosphate 0.373070 0.196624 0.258240 1.000000 0.224840 0.378746 0.807155
Nitrate/Nitrite 0.128686 0.185112 0.995142 0.224840 1.000000 0.700950 0.206868
Total Nitrogen 0.288228 0.641165 0.716507 0.378746 0.700950 1.000000 0.566093
Total Phosphorus 0.321642 0.512931 0.234987 0.807155 0.206868 0.566093 1.000000
In [33]:
# correlation matrix for all quantitative variables
data.iloc[:,2:16].corr()
Out[33]:
Water depth (in meters) Latitude Longitude Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
Water depth (in meters) 1.000000 0.306774 -0.078211 -0.122657 -0.144925 -0.025702 -0.141074 -0.014072 -0.215607 -0.184730
Latitude 0.306774 1.000000 -0.020342 -0.100746 -0.241976 0.102828 -0.325458 0.114923 -0.279980 -0.416150
Longitude -0.078211 -0.020342 1.000000 -0.042832 0.053520 -0.026439 -0.282131 -0.021377 0.027846 -0.181539
Ammonia -0.122657 -0.100746 -0.042832 1.000000 0.076214 0.223906 0.373070 0.128686 0.288228 0.321642
Chlorophyll A -0.144925 -0.241976 0.053520 0.076214 1.000000 0.188035 0.196624 0.185112 0.641165 0.512931
Dissolved Inorganic Nitrogen -0.025702 0.102828 -0.026439 0.223906 0.188035 1.000000 0.258240 0.995142 0.716507 0.234987
Dissolved Inorganic Phosphate -0.141074 -0.325458 -0.282131 0.373070 0.196624 0.258240 1.000000 0.224840 0.378746 0.807155
Nitrate/Nitrite -0.014072 0.114923 -0.021377 0.128686 0.185112 0.995142 0.224840 1.000000 0.700950 0.206868
Total Nitrogen -0.215607 -0.279980 0.027846 0.288228 0.641165 0.716507 0.378746 0.700950 1.000000 0.566093
Total Phosphorus -0.184730 -0.416150 -0.181539 0.321642 0.512931 0.234987 0.807155 0.206868 0.566093 1.000000
In [34]:
# store correlation matrix
corr_mx = x_mx.corr()

# melt to long form
corr_mx_long = corr_mx.reset_index().rename(
    columns = {'': 'row'}
).melt(
    id_vars = 'row',
    var_name = 'col',
    value_name = 'Correlation'
)

# visualize as heatmap
heatmap = alt.Chart(corr_mx_long).mark_rect().encode(
    x = alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    y = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    color = alt.Color('Correlation', 
                      scale = alt.Scale(scheme = 'blueorange',
                                        domain = (-1, 1), 
                                        type = 'sqrt'),
                     legend = alt.Legend(tickCount = 5))
).properties(width = 200, height = 200)
In [35]:
heatmap
Out[35]:
In [36]:
corr_mx
Out[36]:
Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
Ammonia 1.000000 0.076214 0.223906 0.373070 0.128686 0.288228 0.321642
Chlorophyll A 0.076214 1.000000 0.188035 0.196624 0.185112 0.641165 0.512931
Dissolved Inorganic Nitrogen 0.223906 0.188035 1.000000 0.258240 0.995142 0.716507 0.234987
Dissolved Inorganic Phosphate 0.373070 0.196624 0.258240 1.000000 0.224840 0.378746 0.807155
Nitrate/Nitrite 0.128686 0.185112 0.995142 0.224840 1.000000 0.700950 0.206868
Total Nitrogen 0.288228 0.641165 0.716507 0.378746 0.700950 1.000000 0.566093
Total Phosphorus 0.321642 0.512931 0.234987 0.807155 0.206868 0.566093 1.000000
In [37]:
# save heatmap as html for export
heatmap.save('heatmap.html')

Are there any notable differences in available nutrients among U.S. coastal regions?¶

In [38]:
# creating boxplots to see distribution across regions

# nitrogen boxplot
N_box = alt.Chart(scatter_df[scatter_df['Nutrient'] == 'Total Nitrogen']).mark_boxplot(
    size = 50
).encode(
    x = alt.X('Region', title = 'Total Nitrogen'),
    y = alt.Y('Level'),
    color = alt.Color('Region')
).properties(
    width = 300, 
    height = 250
)

# phosphorus boxplot
P_box = alt.Chart(scatter_df[scatter_df['Nutrient'] == 'Total Phosphorus']).mark_boxplot(
    size = 50
).encode(
    x = alt.X('Region', title = 'Total Phosphorus',
              scale = alt.Scale(zero = False)),
    y = alt.Y('Level'),
    color = alt.Color('Region')
).properties(
    width = 300, 
    height = 250
)

# ammonia boxplot
NH3_box = alt.Chart(scatter_df[scatter_df['Nutrient'] == 'Ammonia']).mark_boxplot(
    size = 50
).encode(
    x = alt.X('Region', title = 'Ammonia',
              scale = alt.Scale(zero = False)),
    y = alt.Y('Level'),
    color = alt.Color('Region')
).properties(
    width = 300, 
    height = 250
)
In [39]:
total_region_nutrients = N_box & P_box & NH3_box
total_region_nutrients
Out[39]:
In [40]:
total_region_nutrients.save('total_region_nutrients.html')
In [41]:
scatter_df.head()
Out[41]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Chlorophyll A Province Nutrient Level
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 3.34 Californian Province Ammonia 0.000
1 60 CA 7/1/2010 San Diego Bay West 3.5 32.71424 -117.23527 2.45 Californian Province Ammonia 0.010
2 61 CA 7/1/2010 Mission Bay West 2.2 32.78372 -117.22132 3.82 Californian Province Ammonia 0.000
3 62 CA 7/1/2010 San Diego Bay West 9.5 32.72245 -117.20443 6.13 Californian Province Ammonia 0.000
4 63 NC 6/9/2010 White Oak River Southeast 1.0 34.75098 -77.12117 9.79 Carolinian Province Ammonia 0.002

Based on the 2010 data, does productivity seem to vary geographically in some way?¶

If so, explain how; If not, explain what options you considered and ruled out.

In [42]:
# chlorophyll boxplot
Chl_box = alt.Chart(scatter_df).mark_boxplot(
    size = 50
).encode(
    x = alt.X('Region', title = ''),
    y = alt.Y('Chlorophyll A'),
    color = alt.Color('Region')
).properties(
    width = 300, 
    height = 250
)

Chl_box
Out[42]:
In [43]:
Chl_box.save('chl_box.html')
In [44]:
scatter_df.head(1)
Out[44]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Chlorophyll A Province Nutrient Level
0 59 CA 7/1/2010 Mission Bay West 2.5 32.77361 -117.21471 3.34 Californian Province Ammonia 0.0

How does primary productivity in California coastal waters change seasonally in 2010, if at all?¶

Does your result make intuitive sense?

In [45]:
# plotting date collected vs. chlorophyll a levels in california with categorized waterbody names
ca_scatter = alt.Chart(scatter_df[scatter_df['State'] == 'CA']).mark_point(
).encode(
    x = 'Date collected:T',
    y = 'Chlorophyll A',
    color = 'Waterbody name'
)
In [46]:
# adding regression line

ca_total = ca_scatter + ca_scatter.transform_regression('Date collected', 
                                                        'Chlorophyll A',
                                                       method = 'poly',
                                                       order = 3).mark_line()
In [47]:
ca_total
Out[47]:
In [48]:
# save figure as exportable html
ca_total.save('ca_total.html')

Pose and answer one additional question: What's the relationship between water depth and productivity?¶

In [49]:
# create plots for water depth vs. chlorophyll a levels by region
depth_scatter_facet = alt.Chart(scatter_df).mark_circle(
    opacity = 0.4
).encode(
    x = alt.X('Water depth (in meters)',
              scale = alt.Scale(zero = True)),
    y = alt.Y('Chlorophyll A',
              scale = alt.Scale(zero = True))
).properties(
    width = 400,
    height = 200
).facet(
    row = 'Region'
).resolve_scale(x = 'independent', y = 'independent')
depth_scatter_facet
Out[49]:
In [50]:
# view outliers

data[data['Water depth (in meters)'] > 180]
Out[50]:
UID State Date collected Waterbody name Region Water depth (in meters) Latitude Longitude Province Ammonia Chlorophyll A Dissolved Inorganic Nitrogen Dissolved Inorganic Phosphate Nitrate/Nitrite Total Nitrogen Total Phosphorus
118 422 WA 7/10/2010 Puget Sound West 185.0 47.80405 -122.457533 Columbian Province 0.001 6.26 0.009 0.010 0.008 0.24125 0.041830
245 624 WA 7/7/2010 Puget Sound West 198.0 47.39325 -122.344383 Columbian Province 0.011 17.90 0.023 0.009 0.012 0.35625 0.061139
In [51]:
# non-faceted plot of water depth vs. chlorophyll a levels
depth_scatter = alt.Chart(scatter_df).mark_circle(
    opacity = 0.4
).encode(
    x = alt.X('Water depth (in meters)',
              scale = alt.Scale(zero = True)),
    y = alt.Y('Chlorophyll A',
              scale = alt.Scale(zero = True))
).properties(
    width = 400,
    height = 200
)
depth_scatter
Out[51]: